Automatic Extraction of Linguistic Data from Digitized Documents

نویسندگان

TERRENCE SZYMANSKI

Terrence Szymanski

چکیده

This paper presents a system for automatically extracting linguistic data from digitized linguistic documents using a combination of existing software packages and custom scripts. The system is designed to leverage existing resources in online digital libraries in order to bootstrap the creation of large, multi-lingual linguistic corpora, which can then be used to conduct data-driven experimental research into cross-linguistic or universal linguistic phenomena. The system identifies instances of foreign-language text accompanied by reference-language translations within the text of printed books that have been scanned into digital format, and extracts these to produce a parallel corpus of example sentences. While the system achieves a high precision on predicting foreign text, its accuracy overall is low, and directions for improvement and future work are identified.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improvement of Speech Summarization Using Prosodic Information

Speech summarization is a technique of extracting important sentences from spoken documents. It provides us useful information to looking for the spoken documents that we want. Spoken documents contain non-linguistic information, which is mainly expressed by prosody, while written text conveys only linguistic information. This paper describes a summarization method which uses prosodic informati...

متن کامل

Linguistic Annotation for the Semantic Web

Establishing the semantic web on a large scale implies the widespread annotation of web documents with ontology-based knowledge markup. For this purpose, tools have been developed that allow for semi-automatic annotation of web documents with ontology-based metadata. However, given that a large number of web documents consist either fully or at least partially of free text, language technology ...

متن کامل

Automatic Enhanced Update Summary Generation System for News Documents

Fast changing knowledge systems on the Internet can be accessed more efficiently with the help of automatic document summarization and updating techniques. The aim of multi-document update summary generation is to construct a summary unfolding the mainstream of data from a collection of documents based on the hypothesis that the user has already read a set of previous documents. In order to pro...

متن کامل

Evaluating anaphora and coreference resolution to improve automatic keyphrase extraction

In this paper we analyze the effectiveness of using linguistic knowledge from coreference and anaphora resolution for improving the performance for supervised keyphrase extraction. In order to verify the impact of these features, we define a baseline keyphrase extraction system and evaluate its performance on a standard dataset using different machine learning algorithms. Then, we consider new ...

متن کامل

Metadata Extraction from Bibliographic Documents for Digital Library

This chapter addresses the problem of automatic metadata extraction within digitized documents by retro-conversion techniques. The focus is on bibliographic documents as they are by nature a source of such metadata. They are strongly structuring for a digital library (DL), their automatic recognition presents an obvious interest. However as their origin is very different (references, citations,...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2013

Automatic Extraction of Linguistic Data from Digitized Documents

نویسندگان

چکیده

منابع مشابه

Improvement of Speech Summarization Using Prosodic Information

Linguistic Annotation for the Semantic Web

Automatic Enhanced Update Summary Generation System for News Documents

Evaluating anaphora and coreference resolution to improve automatic keyphrase extraction

Metadata Extraction from Bibliographic Documents for Digital Library

عنوان ژورنال:

اشتراک گذاری